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Abstract We compare the performance of two very different parallel gravitational N- 
body codes for astrophysical simulations on large GPU clusters, both pioneer in their 
own fields as well as in certain mutual scales - NBODY6++ and Bonsai. We carry out 
the benchmark of the two codes by analyzing their performance, accuracy and efficiency 
through the modeling of structure decomposition and timing measurements. We find that 
both codes are heavily optimized to leverage the computational potential of GPUs as their 
performance has approached half of the maximum single precision performance of the 
underlying GPU cards. With such performance we predict that a speed-up of 200—300 can 
be achieved when up to Ik processors and GPUs are employed simultaneously. We discuss 
the quantitative information about comparisons of two codes, finding that in the same 
cases Bonsai adopts larger time steps as well as relative energy errors than NBODY6++, 
typically ranging from 10 — 50 times larger, depending on the chosen parameters of the 
codes. While the two codes are built for different astrophysical applications, in specified 
conditions they may overlap in performance at certain physical scale, and thus allowing 
the user to choose from either one with finetuned parameters accordingly. 

Key words: methods: analytical — methods: data analysis — methods: numerical 


1 INTRODUCTION 

Algorithms for gravitational A-body simulations, which are widely used tools in astrophysics nowadays, 
have mainly evolved into two categories over the passed decades. Traditionally, by computing the pair¬ 
wise force among particles, the direct summation method has been employed as the core idea of the so- 
called “direct A-body code”. High accuracy can be archived by choosing smaller time steps, with higher 
computational costs. Some best-known examples are the NBODY series codes developed by Aarseth 
(1999) and the Star lab environment developed by Hut, McMillan, Makino, Portegies Zwart, etc 
(Hut 2003). Alternatively, the force calculation can be approximated with certain assumptions. In the 
late 1960s, some approximation algorithm such as tree-code or mesh-code are developed in attempt to 
reduce the computational complexity so that larger simulations can be scaled on limited hardware with 
acceptable time. One of the most prominent approximated algorithm, namely Barnes-Hut tree (Barnes & 
Hut 1986), also have many implementations such as GADGET developed by Springel (2005) and PEPC 
developed by Gibbon, Winkel and collaborators (Winkel et al. 2012). 
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Since direct summation methods use an all-to-all particle force direct summation method they have 
a raw computational complexity 0{N‘^). Additional algorithms are developed to cut the absolute wall- 
clock time down in spite of the prohibitive asymptotic complexity. On the other side approximate 
schemes reduce complexities to 0{N \ogN) or even 0{N) thanks to the approximate treatments in 
force computations and some special structures such as octree or grids. However, such intelligent ap¬ 
proximative algorithms may not be very suitable for the simulation of certain astronomical systems such 
as dense star clusters, e.g. globular clusters or nuclear star clusters with or without central massive black 
holes. This is because in these systems two-body relaxation is important, which can only be correctly 
modeled by following pairwise particle interactions with high precision at large distances. They require 
the use of direct summation methods, which has so far great difficulties reaching even one million parti¬ 
cles (but see Wang et al. 2015). As such, the only practical approach at the present to handle simulations 
with ultra-high particle numbers (e.g. cosmological structure formation) is through the employment of 
approximate methods, despite their lack of resolution at small scales (Shin et al. 2014; Genel et al. 2014; 
Vogelsberger et al. 2014). 

Consequently, parallel technologies are applied for A-body simulations as a proper solution. With 
the help of parallel hardware, simulations could be accelerated multifold in accordance with the num¬ 
bers of invoking processors theoretically. In practice parallel schemes were implemented, as well as 
performance analysis of parallel A-body codes on supercomputers or distributed systems, such as the 
work by Gualandris et al. (2007). Supercomputer clusters were used for the parallelization of A-body 
simulations, special devices were also added in order to process the hot computation sections. In the 
beginning, special-purpose architectures called GRAPE series (Makino et al. 2003) are designed for A- 
body simulations exclusively, which achieved speed-ups by putting the whole force calculations into the 
hardware that placing many pipelines on one chip, detailed performance of GRAPE measured by Harfst 
et al. (2007). In recent years, with the rapid development of hardware manufacturing technique the GPU 
(Graphics Processing Unit) as general-purpose devices are more and more used and acted the same role 
as GRAPE. Now architectures consisted of many processors and equipped with corresponding GPUs 
are prevalent solutions in A-body simulations. 

Parallel A-body simulation software running partially or even entirely on GPUs were developed 
subsequently (Berczik et al. 2011, 2013; Spurzem et al. 2012; Bedorf et al. 2012a, b), while in practice 
applications the performance would not measure up to ideal speed-up because of some inevitable serial 
code in the code structure. The actual speed-up is limited by sequential fractions in codes and not di¬ 
rectly proportional to the number of processor cores, the theoretical maximum value could be predicted 
hy Amdahl’s law (Amdahl 1967). What’s more, this peak value is unapproachable on account of com¬ 
munication overhead between multiple processors. The effectiveness of either parallelization or GPU 
acceleration introduced in A-body software is not intuitive but interested. 

In this paper, we focus on the performance analysis of two kinds of A-body software, direct A-body 
code NBODY6-I--I- and tree-code Bonsai , which both can be executed in parallel and accelerated by 
GPUs. Section 2 describes an overview of the software and hardware we used. Section 3 describes the 
performance models used to analyse complicated A-body codes, provides the detailed measurements, 
performance results and reasonable predictions. Section 4 describes the performance comparisons and 
analysis, then makes a conclusion which gives us a better reference in the choice of opportune scheme 
of software type and hardware scale in A-body simulations. 

2 SOFTWARE AND HARDWARE 

2.1 Direct A-body implementation: NBODY6-H- 

In this section we provide a brief description of NBODY6++ , which we used for the performance 
analysis of direct A-body code. 

NBODY6++ developed by Rainer Spurzem is a parallel version of NBODY6 (Spurzem 1999; Khalisi 
et al. 2003; Spurzem et al. 2008). The standard NBODY6 is the 6*^ generation of NBODY code initi¬ 
ated by Sverre Aarseth, who has a lifelong dedication to the development of the family of NBODY series 
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codes (Aarseth 1999). The first code NBODYl was a basic direct A-body code with individual time steps. 
Ahmad-Cohen neighbour scheme (Ahmad & Cohen 1973) was used in NB0DY2 and NB0DY5 made it 
possible to treat larger systems. Kustaanheimo-Stiefel (1965) two-body regularization and chain regu¬ 
larization were applied in NB0DY3 and NB0DY5 to deal with close encounters. By the time NB0DY6 
had developed, the code included both neighbour scheme and regularizations, as well as applied with 
Hermite scheme integration method combined with hierarchical block time steps. 

NB0DY6++ is a descendant of the standard NBODY 6, it kept those features of its predecessor men¬ 
tioned above as well as increased the efficiency by redesigning the algorithms suitable for parallel hard¬ 
ware. NB0DY6++ used the SPMD (Single Program Multiple Data) scheme to achieve parallelism, in 
this mode multiple autonomous processors simultaneously start with chunked local data and then com¬ 
municate with each other through the copy algorithm (Makino 2002; Dorband et al. 2003), which is 
a parallelized algorithm assumes that each processor has a local copy of the whole system and every 
processor handles the subgroup of data itself then broadcasts the new data to all the other processors 
immediately. The parallelization scheme of NBODY6++ is implemented with the standard MPI library 
package. 

The most major improvement of NBODY6++ is the parallelization of regular and irregular force 
computations, which were special concepts introduced from Ahmad-Cohen neighbour scheme that di¬ 
vided the full force of each particle involved by other all particles into two parts: one part called irregular 
force that has a frequent but short time steps for interactions with adjacent particles, another one called 
regular force which has a longer time steps for full interactions. By assigning the most expensive over¬ 
head sections to multiple processors NBODY6++ achieved the expected efficiency. Moreover, the heavier 
regular force computation component was adapted for GPU acceleration using CUDA. The performance 
of parallel accelerations showed in Section 3.1. 

2.2 A-body tree-code implementation: Bonsai 

In this section we provide a brief description of Bonsai , which we used for the performance analysis 
of A-body tree-code. 

Tree-code algorithm as a widely used method nowadays for A-body simulations is originally in¬ 
troduced by Barnes &. Hut (1986). This algorithm reduces the computational complexity of A-body 
simulation from 0{N‘^) to 0{N log A), therefore improves the simulating scale compared to the brute 
force methods. Here Bonsai , a parallel GPU tree-code implementation developed by Jeroen Bedorf, 
Evghenii Gaburov and Simon Portegies Zwart (Bedorf et al. 2012a,b), is a suitable representative of 
gravitational A-body tree-code in recent years. 

Certain schemes are introduced in Bonsai to ensure the high efficiency of the code (Bedorf et al. 
2012a). A sparse octree is used as the data structures, which means the structure is the three-dimension 
extension of a binary tree where tree-cells are not complete and equal, it is based on the underlying par¬ 
ticle distribution. The tree is constructed layer-by-layer from top to bottom, and inverted the direction in 
traverse process. Tree-cell properties are updated during the steps, and the integration of the simulation 
are advanced. The depth of tree traversing affects both accuracy and time consumption crucially, which 
is determined by the multipole acceptance criterion (MAC) in the tree-code. The criterion is described 
as follows, 

d> — + 5 (1) 

where d is the smallest distance between a group and the cell’s center-of-mass, / is the length of the 
cell, 5 is the distance between the cells center-of-geometry and the center-of-mass, 0 is an opening 
angle parameter to control the accuracy. If the inequality is satisfied then traversing process will be 
interrupted while multipole moment will be used. 

Like other existing A-body codes. Bonsai uses parallel technique to reach large scale or high 
resolution simulations, and applies GPUs to speed up force computations. In contrast to those GPU 
tree-code. Bonsai executes all parts of the algorithm on GPUs, to avoid the bottlenecks generated from 
CPU-GPU communications. In Section 3.2 the performance of main parts in the code are presented. 
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2.3 Hardware environments and initial conditions 

The supercomputer we mainly used for all the tests of both software presented above is an IBM iDat- 
aPlex Cluster JUDGE named “The Milky Way System” partition provided and maintained by Jiilich 
Supercomputing Centre in Germany, which is a dedicated GPU cluster using 2 Intel Xeon X5650 6-core 
processors and 2 NVIDIA Tesla M2050/M2070 GPU cards in every node, with 206 compute nodes and 
239 Teraflops peak performance in total. 

Our performance measurements involved two different kinds of parallel gravitational GPU- 
accelerated V-body codes: NBODY6++ and Bonsai . The initial conditions of all tests of both 
codes are consistent with each other, starting with Plummer model and running over standard 1 N- 
body Time Unit. The number of particles used ranges from N = 2^^ (8k) to 2^°(1M), doubled the 
number over the interval successively. There are additional tests using larger particle numbers up to 
N = 2^^(16M) in Bonsai code runs. The number of processors we chose is the series increasing 
number Np = 1, 2,4, 8,16, 32. Other parameters which are necessary but specific only in each code, 
such as time step factor for regular/irregular force polynomial and desired optimal neighbour number 
in NBODY6++ , or accuracy control parameter 9 and softening value e in Bonsai , will be described 
detailed in Section 3. 


3 PERFORMANCE 

In this section we evaluated the performance of these two GPU-based parallel V-body simulation codes 
(i.e. NBODY6++ and Bonsai ) which we tested mainly on the Jiilich Dedicated GPU Environment 
described in Section 2.3. 

In spite of a vast difference between two codes derived from their own fundamental algorithms and 
specific details, which make it difficult to give a one-to-one comparison, there are some global values 
providing sufficient information. Timing variables, speed-up and hardware performance indicators like 
speed and bandwidth were measured below for performance analysis of codes. 

3.1 Performance of NBODY6-t"l- 

NBODY6++ , a parallel direct gravitational V-body code, is featured with a couple of elegant algorithms 
and schemes developed and maintained over the past few decades. The procedures of durative amelio¬ 
rations enabled more realistic size of simulations running in achievable circumstances while increased 
sophistication as well. As a consequence we present a performance model for analysing the overall be¬ 
haviour as well as main components of the code. Through this model we will have a better idea about 
the performance of a typical direct V-body code and predictions about the code behaviour further in 
larger scales. 


3.1.1 Performance model 


We measured running time directly for evaluating performance and modelling. In NBODY6++ , the total 
wall-clock time Ttotai required to advance the simulation for a certain integration interval can be written 
as 


Ttot — Eforce + Ta 


- Thost 


( 2 ) 


where Tforce = Treg + Tirr + Tpre is time spend on both host and device involving force calculations, 
here Tleg, Ti„ and Tp^e are time spend on force computations of regular time step, irregular time step 
respectively and prediction; Tcomm = Tmov + 2mci + Tmcr + Tsyn is time spend on data moving for 
parallel components, MPI communications after regular and irregular blocks and synchronizations of 
processors; Thost is time spend on host side which are absolutely sequential runs. All of the time vari¬ 
ables are measured directly by standard Eortran functions ETIME in sequential mode and MPI_WTIME 
in parallel mode. The entire descriptions are gathered in the last glossary (Table 3). 
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Fig.l Total wall-clock time (Ttot) of NB0DY6++ as a function of & Np. Solid lines are 
the measured values of running time, dashed lines are the ideal acceleration by increasing 
processor numbers. (The unit symbols in the legend have the magnitudes: Ik = 1, 024, IM = 
Ik^ and IG = Ik^, similarly hereinafter.) 


Table 1 Main components of NBODY 6++ 


Timing Expected scaling 


Description 

variable. 

N 

Np 



Fitting value [sec] 


Regular force computation 

Treg 

0{N,^e ■ N) 

0{N-^) 

(2.2 

• 10’ 

-y. 

~f 10.43) ■ 


Irregular force computation 

2irr 

G(iVirr • {Nr,b}) 

0{N-^) 

(3.9 

• 10 

-7. jvi. 

^® - 16.47) • 

A-i 

Prediction 

Tpre 


G(iVp“ 

(1.2 

• 10' 

-6 . N^- 

- 3.58) • N-°-^ 

Data moving 

Ikiov 


0(1) 


2.5 

■ 10"® • 

-0.28 

MPI communication (regular) 

Tmcr 


0{kpcr ■ \ 

) (3.3- 10“® 

.iVi-18 

-t0.12)(1.5 ■ 

iVp— i N 

iVp J 

MPI communication (irregular) 

Tmci 


OikPc. • ^ 

(3.6- 10-’’ 

. iVl-40 

-t0.56)(1.5 ■ 

Np-1. 
Np ) 

Synchronization 

Tsyn 



(4. 

1 • 10"® • N 

-t 0.07) • 

Np 

Sequential parts on host 

Thost 


0(1) 


4.4 

• 10"^ • 

-t 1.23 



Notes: Detailed descriptions of used symbol gathered in Table 3. 


According to the decomposition described above we broke down the code structure and measured 
these main sections which have heavy weights in code. Owing to a large amount of variables and the 
high complexities some insignificant components in NBODY6++ are not counted in. For every parts 
to be analysed we listed the expected scaling and optimal fitting value in Table 1, which are got¬ 
ten from the structure of code implementation, chronograph and fitting functions. A python function 
scipy. optimize . curve_f it are used to obtain the optimal fitting value, which based on non¬ 
linear least squares. 
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Fig. 2 The speed-up (S) of NBODY 6-1--1- as a function of N & Np. Solid lines are the measured 
speed-up ratio between sequential and parallel wall-clock time, dashed lines are the predicted 
performance of larger scale simulations. 


The conception speed-up is used for evaluating the parallelism of the code. There are a couple of 
definitions of speed-up with different ranges. The ideal maximum speed-up Si = Np will never be 
accessible, where Np is the number of processors used. Unreachable as well, but a more reasonable 
indicator to predict the theoretical maximum speed-up is so-caWed Amdahl’s law, which is defined as 


Sa{Np) 


m) 

TiNp) 




(3) 


where X is the fraction of the algorithm that can benefit from parallelization. In practice there is another 
experiential speed-up to be measured through timer recording, which is given by 


Se{Np) 


Ttot(l) 

TtotiNp) 


(4) 


where Ttot(l) & Ttot(./Vp) are both the measured values of actual running time. By combining the fitting 
values into the speed-up formula we will have a general overall perception of the code, and by which 
make a prediction accordingly about the code performance in larger scale simulations. 

The speed of force calculation is measured by the extent at which program reaches the peak of 
computing devices. Here in our tests the computing device particularly refers to the NVIDIA Tesla 
M2050/M2070 GPU cards, which feature up to 1,030 Gigaflops of single precision floating point per¬ 
formance and 515 Gigaflops of double precision floating point performance per card. 

In NBODY 6 ++ , as the total force are divided into two parts we used two speed variables P^eg and 
Pirr represent regular and irregular force calculating speed, which are written as 

_ -^reg.tot _ ^reg * ^ * Th4 p _ .^irr.tot _ ‘ {^nh} * Th4 

p-eg — , Prr — — Tj; 

1 peg reg irr irr 


( 5 ) 
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Fig.3 Hardware performanceofNB0DY6++running on “MilkyWay” GPU cluster. The upper 
panel corresponds to the regular force computation speed (Preg), where two dashed lines 
refer to the peak single and double precision floating point performance. The lower panel 
corresponds to the bandwidth of regular part (Preg)- 


where iV[reg|irr].tot is the total floating point operations of regular/irrgular force computations, At[reg|irr] 
is the cumulative number of regular/irrgular time steps, {N^h) is the average number of integrated 
“neighbour” particles, and 7 h 4 defines the floating point operations counts of 4*^ Hermite scheme per 
particle per interaction per step, from the work Nitadori & Makino (2008) which is a constant value has 
7h4 = 60. 

Bandwidth is measured as a part of hardware performance along with computing speed (P). In 
NBODY6++ , we defined bandwidth (Preg, Pirr) as 


Preg — 






8 ■ (41 + Ima.) ■ N/Np 

Tmr.r 


iV„,ei 8 • 19 • (iVact)/iVp 


T ■ T 

J- niri m 


( 6 ) 


where /V[„icr|mci] is the number of bytes transferred during MPI communication after regular/irregular 
blocks, constant terms derived from the size of datasets transferred, in which Imax is the maximum size 
of neighbour lists set manually. Detailed results of all these performance indicators presented in the next 
section. 


3.1.2 Performance results 

The measured total wall-clock time of NBODY6++ is shown in Figure 1 . 

On the whole, the result shows a good extensibility and acceleration when using more processors 
numbers. To be specific, we assign each part of the code with different weight. Among all of the parts the 
time spend for force computations always has the highest value, therefore both regular, irregular force 
computations and prediction have been implemented with parallel algorithm and decrease rapidly when 
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code running in multi-processors. Here in the heaviest part Tforce, the regular force computation Tj-eg, 
which takes the highest fraction of computing time in the former code versions, has been accelerated 
and implemented on the specific device (GPU), as a consequence of causing a significant reduction of 
the whole running time costs. Other parts are currently executed on CPU side. 

Table 1 shows the Main components of NB0DY6++ as a function of N and Np. As describing above 
for every part expected scaling is evaluated by code structure, and fitting value is based on experimental 
data. The fitting process includes two steps by fitting N and Np successively but independently. Firstly 
we used a minimum of fixed Np to avoid the disturbance of processor number and obtained the ex¬ 
perimental scaling of N. The fitting values with N are obtained under circumstances which using an 
increasing particle numbers and fixed single processor (Np = 1), while for cases of multiprocessor- 
related value(i.e. T^^i, Tmcr, ^syn, which have no numbers in single processor runs) the number of 
processors changed to Np = 2. The fitting values with Np are obtained by the second step. Grouping 
the dataset according to N, then dividing every group data by each N dependent function to get the 
fitting values with Np. The fitting results of every main parts listed in the last column. Considering the 
expected scaling value of main parts, as Tmov, ^syn and Thost have no significant and direct scaling with 
N from code structure while Tpj-e makes up of two prediction branches that determined by N in next 
time step, we expect these values as a simple exponent form of N. N^eg, Ni„ and which used in 

Treg and Tirr are values which are completely dependent with N as iVreg cx: Ni^r oc and 

(Nnb) OC then in the fitting values they are combined together also as an exponent form of N. At 

last, an unified exponent form of N and Np are used in the last column rather than other symbols used 
in the middle column. 

By taking the fitting values into the definition of experimental speed-up, we give the prediction 
about the performance of NBODY 6++ , which is shown in Figure 2. As a result, the optimal value of Np 
needed for larger simulations of different scales showed clearly from the figure. 

Figure 3 shows the hardware performance of NBODY 6++ in the actual environment on a real GPU 
accelerated cluster. Because GPUs played as the central role in acceleration we focus on GPU relevant 
parts in NBODY 6++ , Preg and i?reg were drawn in the figure. For the figure of Preg, two dashed lines i.e. 
peak single and double precision floating point performance are used as the baseline, and the computa¬ 
tion speed of different group of N runs are increasingly closer to the peak when N doubled. In the whole 
NB0DY6++ data structures there is two type precisions used in respective parts. Double precision type is 
used in main loops of the code which declared as the type “REAL*8” in global header file, while for the 
regular force computation part which accelerated by GPU all of the data convert to the type “REAL*4” 
then single precision is used in all relevant parts of CUBA routine. Therefore a mixing precision data 
structure is used in the regular part of NBODY 6++ - double precision in data moving process and single 
precision in data computation process. As the computation process in GPU card dominating the regular 
part, we use single precision to make a comparison. As shown in the figure, the force computation speed 
of large N runs passed over half of the maximum single precision performance (for instance, P^eg for 
IM particle runs have the values 530 ~ 570 Gflops per M2050/M2070 GPU card.) This proportion 
concurs with the results of Berczik et al. (2011, 2013), which claimed the speed performance of another 
direct A-body code (/)-GPU getting the values oc 550 Gflops per C2050 GPU card and oc 1.48 Tflops per 
K20 GPU card, both results approached half of the single precision performance peak. Considering the 
hardware architecture, as operations among various register and memory causes extra inevitable time 
consumption, the proportion is acceptable in practical environments. For the figure of B^eg ignoring 
the “dropping” points, others remained with the level of more than 80% of the maximum bandwidth 
performance. 


3.2 Performance of Bonsai 

3.2.7 Performance model 


Similarly to Equation (2), in Bonsai the total wall-clock time Ttot can be written as 

Ttot — Ptree P Tforce T^comm Toother 


( 7 ) 
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Fig. 4 Total wall-clock time (Ttot) of Bonsai as a function of & Np. The legend is the 
same as Figure 1. Opening parameters of initial condition set as At = 0.0625, e = 0.01, 


9 = 0.5. 


where Ttree = Tsort + Tbuild + Tprop + Tgrp is time spend on tree building, which mainly included 
sorting and reordering of the particles along a ID number string mapped along Space Filling Curve, tree 
structure construction, tree-node properties computation and active groups setting for next steps; Tforce 
is time spend on force computations in tree-traverse; Tcomm = Tdom + Texch + ?syn is time spend on 
distributing or redistributing the particles interprocessor and synchronizations of processors; Tother = 
Tpre + Tcorr + ^ene IS time Consumptions for other mainly essential parts, like local-tree predictions 
before tree construction, corrections after force computations and energy check. All of the time variables 
are measured by CUBA C function cuEventElapsedTime from CUBA Event Management Driver 
API and counted on device (GPU) entirely. Trivial time consumptions on the host side is ignored. The 
entire descriptions are gathered in the last glossary (Table 3). 

In hardware performance aspect, differed from Equation (5) the force calculating speed in Bonsai 
are written as 


f^force — 


Nfo 


Nio 


•7t 


Tfo 


Tfo 


(8) 


where Afforce.tot is the total floating point operation counts; A^force is the cumulative number of interac¬ 
tions; 7 t is the number of operation counts for each interaction in tree-code, which we used a constant 
value has 7 t = 38 from the work Warren & Salmon (1992); Kawai et al. (1999); Hamada et al. (2009); 
Hamada & Nitadori (2010), and the result figure shows in Eigure 6. Note that Bedorf et al. (2014) used 
another separated operation counts 23 & 65 for particle-particle and particle-cell interactions respec¬ 
tively. 
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Table 2 Main components of Bonsai 


Description 

Timing 

variable 

Expected scaling 

N Np 

Fitting value [sec] 

Sorting and reordering 

Tsort 

0{N) 

0{Nf^) 

(1.5 • 10““ • N -k 2.45 • 10"'‘) • 

Tree construction 

Tbuild 

0{N) 

0{Nf^) 

(2.8 • 10“’^ • N + 2.06 • 10”^) • 

Node properties 

Tprop 

0{N) 

G(Ap-i) 

(9.1 • 10“® • N -k 5.78 • 10"®) • Ap-i 

Set active groups 

Tgrp 

0{N) 

0{Nf^) 

(1.7 • 10“® • N + 1.16 • 10"®) • iVp"i 

Force computation 

^?force 

OiNlogN) 

0{Np 

(2.5 • 10"® • MogA - 0.10) • Ap"° ®® 

Domain update 

Tdom 

G(MogA) 

G(l) 

5.4 • 10"®® • AlogA -k 2.96 • 10"® 

Exchange 

^?exch 

0{mogN) 

G(l) 

2.1 • 10"® • AlogA + 1.16 • 10"® 

Synchronization 

Tayn 

C>(A'“"“) 

Oikpsi ■ 

(1.4 • lO""® • A®-®® -k 9.3 • 10"®)(0.5 • A®-®®) 

Prediction 

Tpre 

0{N) 

G(iVp-i) 

(1.5 • 10"® • A + 1.49 • 10"®) • Ap"® 

Correction 

Tcorr 

0{N) 

G(Ap-i) 

(3.8 • 10"® • A + 7.88 • 10"'‘) • 

Energy check 

Tene 

0{N) 

0(N-^) 

(8.8 • 10"® • A + 7.14 • 10"^) • Ap"i 


Notes: Detailed descriptions of used symbol gathered in Table 3. 


3.2.2 Opening parameters in tree-code 

Three opening parameters play significantly important roles in tree-code running and affect the perfor¬ 
mance consequently as a different result. 

0: It is a dimension-less parameter defined in Equation ( 1 ) that controls the accuracy. Our test results 
showed that a smaller 9 makes the running time increasing shapely, then stopped rising on a certain range 
(6 Ri 0.01 as an experimental value); while a bigger 9 {0 ^ 0.2 ^ 0.35 influenced by N) causes a less 
accuracy of simulations. 

e; The softening parameter e does not contribute to the running time, on the other hand an optimal 
e could lead to a best approach to the minimum error. For a too small softening the estimates of the 
forces will be too noisy, while for a too large softening the force estimates will be misrepresented 
systematically, in between there is an optimal softening. The optimal e depends both on the number 
of particles and the size of time steps. From the work Athanassoula et al. (2000) when e has different 
minimum values the conditions of simulations are not the same, there is a relationship between N and 
optimal e. Through the comparison of their conclusions with groups of our test results of Bonsai code 
(using /S.E instead of MASE, 9 = 0.5; Af = 0.0625), we conclusion that the value of e what leads to a 
minimum AE is consistent with conclusions of the reference. 

At: The value of At affect both running time and energy error. On the running time side, At has a 
noticeable linearly dependent with time as: t oc 1/At; on the relative energy error side, the tests results 
are more complex. Our results shows that AE^nin varies sensitively under the chosen of combinations 
of At and e, as well as the different N. 

3.2.3 Performance results 

The measured total wall-clock time of Bonsai is shown in Figure 4. 

In Figure 4 & 5 we do not use any data for small particle number and large number of GPU nodes, 
because the performance goes down and the GPU’s are not fully loaded in this regime. 

We decomposed Ttot into main components as described in Section 3.2.1. For every components we 
measured the running time separately, and obtain fitting formulae as a function of N and Np. The fitting 
procedures was the same as in the NBODY6++ part described in Section 3.1.1, the results are shown in 
Table 2. 

Figure 5 shows the experimental speed-up of Bonsai defined as Equation (4). Compared with 
the result of Figure 2, Bonsai has tendency to a lower peak but wider scope. Considering the weight 
factors of force computational part in Table 1 & 2 quantitative information of ascendant distinctness are 
revealed, while the different weight of communication part is the mainly determinant for descendant 
lines. 
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Fig.5 The speed-up (S) of Bonsai as a function of particle number N & Np. The legend is 
the same as Figure 2. Opening parameters of initial condition set as At = 0.0625, e = 0.01, 


9 = 0.5. 


Figure 6 shows the hardware performance of Bonsai in a practical environment on a real GPU 
accelerated cluster. The performance in floating point operations per second is given for the dominant 
part of the force computations only. The figure indicates that for the large enough N we get near half 
of the peak single precision of our M2050/M2070 GPU card and increasing steadily for large scale Np. 
This proportion is similar to the result of NBODY6++ code discussed in Section 3.1 . The utilization of 
the GPU is quite good considering the tree-code structure. 

4 DISCUSSION 

In this paper we analyze the performance of two very different kinds of A^-body codes, both pioneers 
in their fields and both heavily optimized for GPU acceleration and parallelization - NBODY6++ and 
Bonsai . There is always the question what is the turn-even point for the codes, how do they compare 
with each other. Due to the very different nature of the two codes such a comparison is inevitably unfair 
- NBODY6++ has few-body regularizations and is aimed for high accuracy of both near and more distant 
gravitational forces; Bonsai achieves optimal performance if the opening parameter 9 is relatively 
large, providing rather less accurate gravitational forces. But in certain ranges of parameters both codes 
may overlap in performance, accuracy and efficiency. It is the goal of this paper to provide a quantitative 
information about this. 

We do this with the help of the four panels in Figure 7 - they show wall-clock time and energy accu¬ 
racy as a function of the average time step; the main curves are for Bonsai as indicated in the caption, 
for two different opening parameters. However, also data for NBODY6++ are shown for comparison: 
wall-clock time and accuracy as a function of average time step. In addition, we show that for a fixed 
particle number the time step of Bonsai which results to the same wall-clock time as for NBODY6++ . 
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Fig. 6 Hardware performance of force computation speed (Pforce) of Bonsai running on 
“MilkyWay” GPU cluster. Two dashed lines in the figure refer to the peak single and double 
precision floating point performance. 


The following main conclusions can be drawn: at same wall-clock time and same particle number 
(and 9 = 0.2) Bonsai runs typically with time steps of a factor of 10 — 50 larger. In other words 
NBODY 6 ++ provides a much smaller time step and a factor of 10 better accuracy (see lower panels of 
Figure 7). Here energy error are used as the criterion to compare the accuracy. In our case the time 
evolution of the energy error contain two main parts. One part comes from the machine accuracy of the 
potential and force calculations. This is in our cases close to the single precision machine accuracy of 
order 10“^ ~ 10“®. The other error component comes from the numerical integration process itself, 
which plays as the dominated role in total energy error. In the NBODY 6 ++ we are using the complex 
Hermite 4*^ order individual block time step integration combined with the Ahmad-Cohen neighbour 
scheme. We have chosen the time step parameter 77 of the Aarseth time step criteria (for regular and 
irregular time step, which values set as 0.02 in initial input files) such that the energy error keeps the level 
of 10“® 10“^. How the global energy error of our integrator in NBODY 6 +- 1 - behaves can be found in 

a comprehensive study by Makino (1991). In the case of the Bonsai , the code using the simple leap¬ 
frog integration scheme, which (for the reasonable computational speed to reach the 1 A-body Time 
Unit) have a average energy error in a level of 10“® ^ 10“®. Insofar we have not discovered anything 
unexpected; however Bonsai can reach surprisingly good accuracy in total energy (like 5 • 10“®) at 
wall-clock times comparable to NBODY 6 - 1 -+ . With a larger opening angle {9 = 0.5) the time step and 
wall-clock parameters approach each other more (factor two to three, for one million bodies). In such 
a case still there is a quite considerable energy accuracy of order 10“®. That levels of energy accuracy 
arguably may be sufficient even for collisional gravothermal systems, as it seems. 

However, the total energy conservation is not the only criterion to judge about the use of a code 
and its accuracy. In NBODY6++ close encounters and interactions of compact or hierarchical multiple 
systems are treated with regularization methods and zero softening, while Bonsai uses an artificial 
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Fig. 7 Comparisons of wall-clock time and relative energy error of NB0DY6++ and Bonsai 
as a function of At. Opening parameters of Bonsai set as e = 0.01, 9 = 0.5 in the left 
column, and smaller value e = 0.001, 9 = 0.2 as the control group in the right column. 
In every panel the left dashed line corresponds NBODY6++ benchmark data, and solid lines 
are Bonsai data. The diamond symbols indicates junctions of Bonsai which has the same 
running time as NBODY 6+-i- in the case of the same N. 


softening of the gravitational potential at small distances. Reasonable energy conservation refers to that 
artificial gravitational potential including softening, which is conservative as well, but not the true few- 
body potential. So the additional numerical efforts necessary for NBODY6++ goes on one hand side into 
the exact resolution of all kinds of close interactions below the softening length used in Bonsai . But 
also on the other side, the long-range interactions. Bonsai uses the standard Tree-code procedure of ap¬ 
proximating forces from groups of particles by forces from their centers of masses and multipoles. This 
feature needs to be tested by simulation of core collapsing star clusters, where long range gravitational 
interactions determine the global evolution, which is beyond the scope of this paper. 
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Table 3 Glossary 


Variable Description 

General 


N 

Np 

Sa 

Si 

Se 

P 

B 

Ttot 

kUj: , kpj: 

AE 

At 


total particle number 
number of processors 

theoretical maximum speed-up defined by Amdahl’s law 
ideal maximum speed-up equals to Np 

experiential speed-up equals to the ratio between measured time of single 
and multiple processor number 

force computation speed of floating point operations per second 
bandwidth of bytes of data transfer per second 
total wall-clock time 

quantitative factors for fitting result of certain parts; fc[n|p] implies the fac¬ 
tor only depends on N\Np, subscript x indicates different parts 
relative energy error 
time step interval of integration 


NB0DY6++ 


(Aact) average number of integrated active particles 

(Anb) average neighbour number 

Airr cumulative number of irregular time steps 

Areg cumulative number of regular time steps 

7 h 4 floating point operations counts per particle per interaction per step 

icomm sum of Communication time 

Tforce sum of force computation time 

Thost time spend on the host side 

Tirr neighbour (irregular) force computation time 

Tmci MPI communication after irregular blocks 

Tmcr MPI communication after regular blocks 

Tmov time spend on data moving for parallel runs 

Tpre particle prediction time 

Treg full (regular) force computation time 

Tsyn interprocessor synchronization time 

Bonsai 

Aforce cumulative number of interactions 

7 t floating point operations counts per particle per interaction per step 

ibuiid time spend on tree structure building 

Tcomm sum of Communication time 

Tcorr particle correction time 

Tdom time spend on update of particle domain 

Tene energy check time 

Texch time spend on particle exchange 

Tforce force computation time 

Tgrp time spend on setting active groups 

Tpre local tree prediction time 

Tprop node properties computation time 

Tsort sorting and data-reordering time 

Tsyn interprocessor synchronization time 

Ttree sum of the whole tree construction time 

e softening to diminish the effect of graininess 

6 opening angle to control the accuracy 
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